Distribution of Bubble Lengths in DNA 
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The distribution of bubble lengths in double-stranded DNA is presented for segments of varying 
guanine-cytosine (GC) content, obtained with Monte Carlo simulations using the Peyrard-Bishop- 
Dauxois model at 310 K. An analytical description of the obtained distribution in the whole regime 
investigated, i.e., up to bubble widths of the order of tens of nanometers, is available. We find 
that the decay lengths and characteristic exponents of this distribution show two distinct regimes 
as a function of GC content. The observed distribution is attributed to the anharmonic interactions 
within base-pairs. The results are discussed in the framework of the Poland-Scheraga and the 
Peyrard-Bishop (with linear instead of nonlinear stacking interaction) models. 
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Fluctuating thermal openings of the double helix of 
DNA (bubbles) are known to exist in a regime extending 
well below the denaturation transition, which includes 
physiological temperatures. These openings could be ex- 
ploited by regulatoryproteins or functional enzymes to 
perform their work [Ij. It has been recently suggested 
that there is an increased probability for large bubbles 
to appear at functionally relevant sites of gene promoter 
DNA segments 0, More accurate numerical calcu- 
lations 0, Q have unadvertedly reinforced this idea Q , 
by finding an increased opening probability at functional 
binding sites of the promoter region upstream of a gene 
initiation site. Such a picture is in analogy to the func- 
tional role of fluctuations in proteins Q. This has led 
to an increased interest for equilibrium and dynamical 
studies of thermally induced bubbles in DNA. 

In a different context, recent experiments on cycliza- 
tion of short DNA fragments [1, Q have drawn atten- 
tion toward the possibility that regions of very low rigid- 
ity may appear in double stranded DNA [10|. It has 
been proposed [111, [l2| that these bending kinks may be 
caused by the formation of denaturation bubbles, letting 
the higher flexibility of single stranded DNA explain the 
sharp bending observed in experiments Q. In this con- 
text, the question of how frequently denaturation bub- 
bles arise is one of the keys for the understanding of this 
phenomenon. 

Here we present the distribution of bubble lengths, for 
lengths ranging from 0.34 nm (single base-pair openings) 
up to a few tens of nanometers (corresponding to open- 
ings of size several tens of base-pairs). The bubble length 
distributions have been obtained with Monte Carlo simu- 
lations, using the Peyrard-Bishop-Dauxois (PBD) model 
[l3| for describing the energy of base-pair openings. We 
discuss equilibrium distributions at T = 310 K, averaged 
over DNA segments with total size of one thousand base- 
pairs. The bubble length distributions depend on the 
guanine-cytosine (GC) content, i.e., the fraction of GC 
base-pairs in the sequence. The variation of the distribu- 



tion as a function of the GC content is investigated. Note 
that the size of a bubble is characterized by its length 
(width) and its amplitude. Here we flx a threshold for 
the amplitude and when a bubble of length I base-pairs 
(or L = 0.34 X I nm) is mentioned, it means that all 
the I successive base-pairs have openings larger than the 
fixed amplitude threshold ythres, while the limiting base- 
pairs (the first neighbors to the left and to the right of 
the stretch of the I successive base-pairs) have smaller 
openings than that. Periodic boundary conditions are 
used, so our study considers only bubbles in the mid- 
dle of a sequence, disregarding end effects. Fraying, i.e. 
the opening of the DNA molecule starting at one of its 
ends, is a very interesting problem that will be addressed 
elsewhere. 

In this Letter we show that the bubble length distri- 
butions, obtained within the PBD model, follow a non- 
exponential law and, further, an analytical description 
(see Eq. ([2]) below) is available describing the observed, 
slower than exponential, decay with the length. This 
behavior has been quantified for various GC contents 
and remains qualitatively the same independently on the 
value of the fixed amplitude. The fact that the same dis- 
tribution law is also predicted by the completely unre- 
lated Poland-Scheraga Q model of DNA, strongly sug- 
gests that our findings are beyond theoretical speculation 
and might be a proper description of actual DNA physics. 

In recent investigations, numerically exact equilibrium 
properties 0, Q and Langevin dynamics up to nanosec- 
ond time-scales averaged over several hundred realiza- 
tions have been achieved taking advantage of the 
efficiency of the simple PBD model for describing base- 
pair openings in double-stranded DNA. The PBD model 
coarse-grains the relatively rigid internal structure of the 
nucleotides and considers their anharmonic stretching in- 
teractions at the single base-pair level 0, 12 1. Its 
accuracy has been demonstrated by several comparisons 
with different experiments 0, [3, [l3| . 
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The potential energy of the PBD model is given by: 



V = 



(1) 



The sum is over all the base-pairs of the molecule and 
Un denotes the relative displacement from equilibrium at 
the n*'* base pair. The first term is an on-site Morse po- 
tential, representing the hydrogen bonds between bases 
in the same pair as well as other effective interactions 
between complementary nucleotides. The second term 
is an anharmonic coupling between adjacent base pairs 
that models the stacking interaction. The heterogene- 
ity of the genetic sequence is taken into account giving 
different values to the parameters of the Morse poten- 
tial for adenine-thymine (AT) or guanine-cytosine (GC) 
base pairs. The values of the parameters we have used 
[20| were fitted in Ref. to reproduce thermodynamic 
properties of DNA. The same parameters have been sub- 
sequently used, without additional fitting procedures, to 
successfully describe experimental observations [1, [3] . 

To study the statistics of bubbles we have performed 
Monte Carlo simulations of the PBD model using the 
same procedure introduced in Ref. [l^l ■ The Metropolis 
algorithm was used to produce equilibrium configurations 
of the molecule at T = 310 K. Results were averaged, af- 
ter proper thermalization, over several realizations (typi- 
cally 25, each one consisting of 8 • 10^ Monte Carlo steps, 
which makes a total of 2 • 10^ steps) with different initial 
conditions. As we used quite large DNA sequences (con- 
taining 1000 base-pairs) and the temperature studied is 
well below the melting temperature, the probability that 
a complete melting occurred during the time of a simu- 
lation was so low that no melting events (that, for long 
enough simulations, would eventually take place for this 
model at any temperature [3, ISIi ) were observed. We 
show results for bubbles of amplitude equal or greater 
than ythres — 1.5 A. We have also studied bubbles of 
amplitude ythres 0.5 A, 3 Aand 5 A, finding the same 
qualitative results presented here. We have checked that 
our results are independent of the length of the sequences 
studied, provided they are longer than the longest ob- 
served bubble and there is no complete melting. 

Figure[T]shows the bubble length distribution per base- 
pair, P{1), obtained from our Monte Carlo simulations. 
This histogram is defined as P{1) = limi^oo(A^(0)/-^i 
where N(l) is the averaged (during the Monte Carlo sim- 
ulation) number of bubbles of length I base pairs on a 
sequence of total size L, and the average is over different 
realizations of an uncorrelated random sequence with a 
given GC content. The limit L — > oo in the definition 
of P{1) has the meaning that the total size of the DNA 
segment should be sufficiently larger than the considered 
bubble lengths. Under this condition, the distribution 
P{1) should be interpreted as follows: for a given DNA 
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FIG. 1: Distribution per base-pair of bubble sizes I (in number 
of base-pairs, Z > 1), P{1), for random sequences with different 
GC contents (points, as indicated in the label). Also shown 
is the result for a segment from E. coKs gal promoter with a 
37.45% GC percentage (thick line). Thin lines are fits with the 
analytical distribution, Eq.©. T — 310 K, ythres ~ 1.5 A. 
Inset: Probability Pb{l) = IP{1) (for I > 1) that an individual 
baser-pair forms part of a bubble of size I (see text). 



sequence of total length L base-pairs, the quantity P{l)-L 
gives the average number of occurrences of a bubble of 
length / base-pairs, in thermal equilibrium. 

From P{1) we obtain the probability that an indi- 
vidual base-pair forms part of a bubble of length I as 
Pfc(Z) — IP{1) for I > 1, while the probability of not 
belonging to any bubble (i.e., that the base-pair has an 
opening smaller than ythres) is ^6(0) = ^'(O). From the 
definition of P{1) follows that P(0) = 1 - E/>i^-P(0> 
thus ensuring correct normalization Pb{^) = 1 of the 
probability P&. 

As a check of our simulations, we have also computed 
for pure GC and AT sequences the probability of i) a sin- 
gle base-pair to have an opening larger than the consid- 
ered fixed amplitude and ii) two neighboring base-pairs 
to simultaneously have openings larger than the fixed 
amplitu de, using the transfer integral operator method 
241 . The probabilities calculated in these two 
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cases, using a different but numerically exact method, 
are in a very good agreement (up to numerical accuracy) 
with those obtained through the Monte Carlo simulation 
from the distributions P{1), as Y.i>i^P{^) = 1 - -P(O) 
and X]i>2(' ~ (Oi respectively. 

We find that the distributions per base-pair can be 
fitted in the whole range of bubble sizes studied [25j . 
with a single function of the form: 



P[l) = W- 



-i/i 



for I > 1. 



(2) 



A fit with a stretched exponential function having a 
stretching exponent smaller than 1 is also accurate [26(, 
but we have preferred Eq. ^ because this functional 
form can be derived from existing theories both for a 
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simplified version of the used model 27| (the Peyrard- 
Bishop model, with linear stacking interactions [28| ) and 
for the independent Poland-Scheraga model [l^ . 

In order to investigate which interactions of those 
present in the model are responsible for the observed 
behavior of the distribution, we have performed simi- 
lar calculations varying different terms of the potential 
of Eq.([T]). When p = 0, i.e. the stacking interaction is 
linearized, the distribution remains nonexponential, al- 
though it is described by smaller c, in complete agree- 
ment with previous studies [27| . However, when the on- 
site Morse potential is linearized (substituting the Morse 
potential by its harmonic approximation), then the ob- 
tained distribution is exponential. This happens inde- 
pendently of setting p = or p = 2. Therefore, we 
infer that the nonlinear interactions between the bases 
within base-pairs result in the observed bubble length 
distribution. These anharmonic on-site interactions have 
been found to qualitatively affect other properties of the 
model as well [29] . This kind of interaction appears also 
in modified models where the stacking interaction has a 
qualitatively different shape [s^ . 

In Figure [T] is also shown, for sake of comparison, a 
result for a natural sequence; a segment from Escherichia 
coli's gal promoter 3l|, extending from the position —160 
(upstream) up to the site -|-91 (downstream) around the 
transcription start site of galE gene. This sequence has a 
GC fraction of 37.45%, and, as can be seen from Figure 
[U its results are indistinguishable from those of a random 
sequence with almost the same GC content (37.5%). This 
is expected, as it is in agreement with the known fact (3^ 
that, despite statistical correlations in the sequence and 
local properties, equilibrium averaged physical properties 
of large DNA molecules are basically similar to those of 
random sequences. 

DNA molecules with higher GC content exhibit a faster 
decay (see Figure [T]) , as the stronger bound GC pairs 
make large bubbles less likely. The change of the dis- 
tribution with the GC content is quantified in Figure O 
where the dependence of the parameters of Eq. ^ on 
the GC fraction is shown. All the parameters decrease 
monotonically with the GC percentage. The decay of W 
signifies that the higher the GC content the more diffi- 
cult it is to excite large openings in the double strand, 
therefore in AT rich sequences bubbles have a higher sta- 
tistical weight. The decay length ^ is smaller for GC-rich 
sequences, in which the distribution decays faster. The 
variation of all these parameters can be fitted by a bi- 
linear or a biexponential function (see Figure [2|), reveal- 
ing two distinct regimes on the GC-fraction dependency: 
above and below a GC content of about 40%. Previ- 
ous studies have shown that the nucleation of bubbles 
depends strongly on the sequence: the weaker AT base- 
pairs have to go over the potential barrier imposed by 
their GC neighbors in order to break the bonds. Our 
findings suggest that for GC concentrations over 40%, 
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FIG. 2: Dependence of the exponent c (top), decay length ^ 
(bottom), and preexponential coefficient W (bottom, inset) of 
the distribution, Eq. ((2)1, on the GC content of the sequence 
(circles). Lines show fits of two distinct regimes with linear 
(exponential for ^) functions. T = 310 K, ythres ~ 1-5 A. 
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FIG. 3: Dependence of the average bubble length Lb, Eq. 
0, on the GC content of the sequence (points). Continuous 
line shows a fitting with an exponential decay. T — 310 K, 

ythrea = 1-5 A. 



the formation of bubbles is a GC-dominated process, as 
AT base pairs are not in average free to melt without GC 
opposition. But below 40% GC content, large AT regions 
are possible that form bubbles freely, hence dominating 
the bubble formation process. 

Figure [3] presents the average bubble length Lb, which 
is given by the total number of base-pairs in bubble states 
divided by the total number of bubbles: 

Lb depends strongly on the GC content, showing an ex- 
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FIG. 4: Temperature dependence of the parameters of distri- 
bution, Eq. ([21, for pure GC sequences. Left: exponent c, 
the line is a linear fit. Right: decay length ^, the line is a fit 
with a divergent behavior as (Tc — T)~^, where the melting 
temperature is predicted to be Tc = 365.5 K, in agreement 
with transfer integral calculations. 



ponential decay. This stresses the importance the se- 
quence has on the typical size of denaturation bubbles. 

A recent work reports exponential decay of the bubble 
length probability [l5| . Although this seems to contradict 
our present results, this is not the case because the study 
[ist considers only a restricted regime of lengths, extend- 
ing from Z = 3upto^ = 12 — 14. Therefore, a restricted 
portion of the distribution may appear as a seemingly 
straight line in a semilogarithmic plot, indicating expo- 
nential decay. However, if one looks for either smaller 
{I — 1, 2) or larger lengths, below or above this regime, 
the exponential law fails. On the contrary the formula 
(HI), or a stretched exponential law [2^, describes accu- 
rately the whole regime from Z = 1 up to several tens 
of base-pairs. This means that large bubbles are more 
probable than what it would have been implied by expo- 
nential decay. Earlier studies of a simplified version of 
the PBD model (using a linearized stacking interaction) 
have also shown a slower than exponential decay of the 



distribution function 27, 3 



Importantly, the Poland-Scheraga (PS) model of DNA 
melting predicts a bubble length distribution described 
by Eq.© [H HI, [s^. Hence our results establish 
a bridge between these two different theoretical ap- 
proaches, which could help to future better understand- 
ing of the principles underlying the models, and hence- 
forth a more effective modeling approach to DNA. In the 
PS model, the exponent c has a very important phys- 
ical meaning, as its value at the critical temperature 
(where ^ diverges and the distribution is given by a pure 
power-law) indicates the order of the melting transition: 
for c < 1 there is no phase transition, the transition is 
smooth for 1 < c < 3/2, second order for 3/2 < c < 2 and 
first order for c > 2. It is interesting to characterize this 



exponent also for the PBD model; work in this direction 
is in progress and a detailed temperature dependence of 
bubble length distributions will be presented elsewhere 
[37|. Independently of the model, for c > 2 the average 
bubble length Lb oc J^^-^iO remains finite as T — > T~, 
and it diverges for c < 2, thus presenting a discontinuous 
or a continuous transition, respectively. 

As an advance of our work in progress, in Figure [4] we 
depict a preliminary result for the exponent c and the de- 
cay length ^ obtained for pure GC sequences at different 
temperatures. As happens in the PS model [sJl, also here 
the exponent increases linearly with temperature below 
the critical temperature. The PS model also predicts that 
^ diverges as (T^ - T)-i Q. Using this to fit the C data, 
we obtain a value Tc(GC) = 365.5 K, in agreement with 
transfer integral calculations we have performed. Hence 
the predicted value of the exponent c at Tc is c = 1.94 
(compatible with the value c = 1.95 used in biological 
studies [38*1 ) : we would have a very sharp transition, but 
still not first order, if the PS scheme applies. But we 
have to be very careful: for AT at T = 310 K, Figure [H 
shows that c = 1.96. A transfer integral calculation yields 
Tc (AT) — 325.5 K, so the exponent at Tc is expected to 
be higher than 2. Work in this direction is necessary, but 
the possibility to speculate that not only the transition 
temperature but also the nature of the phase transition 
itself depends on the sequence, is too tempting to let it 
go. The role that the parameters of the model play in this 
has also to be carefully investigated. Interestingly, pre- 
vious work using the Peyrard-Bishop model without the 
anharmonic stacking interaction (p = 0) yields a value of 
c much lower than 2 27[. Thus, regardless whether the 
transition in the full PBD model is first order or not, the 
analysis of bubble distributions shows once again 
that the anharmonic stacking interaction is responsible 
for the sharp (c « 2) transition in the model. 

We are not aware of any experimental technique able 
to track individual bubbles in long sequences in order to 
study their size distributions. However, with study of the 
probabilities of complete melting of molecules of differ- 
ent sizes, the power law behavior we have described may 
appear. The probability of complete melting of a given 
sequence has already been experimentally measured fs^l 
using a novel technique based on hairpin quenching. This 
same technique, applied at constant temperature to se- 
quences of varying length but similar composition, should 
clarify how the formation of bubbles depends on their size 
below the melting temperature. However, it should be 
noted that this experiment measuring complete melting 
is not exactly equivalent to our study of bubble forma- 
tion: sequence, finite-size and boundary effects may play 
an important role. 

In summary, using the Peyrard-Bishop-Dauxois model 
we have shown that at physiological temperatures the for- 
mation of thermally induced bubbles of different sizes fol- 
lows a nonexponential distribution with long tails, due to 
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nonlinear interactions within base-pairs. The occurrence 
of these openings may have an important effect, which 
should be taken into account in biological processes in- 
volving the opening of double stranded DNA. Moreover, 
structural properties of the DNA molecule may also de- 
pend on the frequency of occurrence of these bubbles. 
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